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Abstract 

We propose a Bayesian approach to learn discriminative 
dictionaries for sparse representation of data. The proposed 
approach infers probability distributions over the atoms of a 
discriminative dictionary using a Beta Process. It also com¬ 
putes sets of Bernoulli distributions that associate class labels 
to the learned dictionary atoms. This association signifies the 
selection probabilities of the dictionary atoms in the expan¬ 
sion of class-specific data. Furthermore, the non-parametric 
character of the proposed approach allows it to infer the 
correct size of the dictionary. We exploit the aforementioned 
Bernoulli distributions in separately learning a linear classifier. 
The classifier uses the same hierarchical Bayesian model as 
the dictionary, which we present along the analytical inference 
solution for Gibbs sampling. For classification, a test instance 
is first sparsely encoded over the learned dictionary and the 
codes are fed to the classifier. We performed experiments for 
face and action recognition; and object and scene-category 
classification using five public datasets and compared the re¬ 
sults with state-of-the-art discriminative sparse representation 
approaches. Experiments show that the proposed Bayesian 
approach consistently outperforms the existing approaches. 

Index Terms —Bayesian sparse representation, Discriminative 
dictionary learning, Supervised learning, Classification. 


I. Introduction 


Sparse representation encodes a signal as a sparse linear 
combination of redundant basis vectors. With its inspirational 
roots in human vision system (16) , ED this technique has 
been successfully employed in image restoration (T8| , p9| , 
pO) , compressive sensing ED (22) and morphological com¬ 
ponent analysis (23). More recently, sparse representation 
based approaches have also shown promising results in face 
recognition and gender classification §, ED, ed, EH 
(26), texture and handwritten digit classification (14 
1301, (31) , natural image and object classification [9 
321 and human action recognition (33), (34), (35), [36 


The success of these approaches comes from the fact that a 
sample from a class can generally be well represented as a 
sparse linear combination of the other samples from the same 
class, in a lower dimensional manifold 

For classification, a discriminative sparse representation 
approach first encodes the test instance over a dictionary, i.e. 
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a redundant set of basis vectors, known as atoms. Therefore, 
an effective dictionary is critical for the performance of such 
approaches. It is possible to use an off-the-shelf basis (e.g. 
fast Fourier transform ED or wavelets (42)) as a generic 
dictionary to represent data from different domains/classes. 
However, research in the last decade ( ©, (To), ED, 

ETED’ED’ (45) ) has provided strong evidence in favor of 
learning dictionaries using the domain/class-specific training 
data, especially for classification and recognition tasks (TO) 
where class label information of the training data can be 
exploited in the supervised learning of a dictionary. 

Whereas unsupervised dictionary learning approaches (e.g. 
K-SVD [6], Method of Optimal Directions (46)) aim at 
learning faithful signal representations, supervised sparse rep¬ 
resentation additionally strives for making the dictionaries 
discriminative. For instance, in Sparse Representation based 
Classification (SRC) scheme, Wright et al. |[8) constructed 
a discriminative dictionary by directly using the training 
data as the dictionary atoms. With each atom associated to 
a particular class, the query is assigned the label of the 
class whose associated atoms maximally contribute to the 
sparse representation of the query. Impressive results have 
been achieved for recognition and classification using SRC, 
however, the computational complexity of this technique be¬ 
comes prohibitive for large training data. This has motivated 
considerable research on learning discriminative dictionaries 
that would allow sparse representation based classification 
with much lower computational cost. 

In order to learn a discriminative dictionary, existing ap¬ 
proaches either force subsets of the dictionary atoms to 
represent data from only specific classes ED ED ED or 
they associate the complete dictionary to all the classes and 
constrain their sparse coefficient to be discriminative 0, 

(28) . A third category of techniques learns exclusive sets of 
class specific and common dictionary atoms to separate the 
common and particular features of the data from different 
classes ED 0- Establishing association between the dic¬ 
tionary atoms and the corresponding class labels is a key 
step of existing methods. However, adaptively building this 
association is still an open research problem ED- Moreover, 
the strategy of assigning different number of dictionary atoms 
to different classes and adjusting the overall size of the 
dictionary become critical for the classification accuracy of 
the existing approaches, as no principled approach is generally 
provided to predetermine these parameters. 

In this work, we propose a solution to this problem by 
approaching the sparse representation based classification from 
a non-parametric Bayesian perspective. We propose a Bayesian 
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Fig. 1: A schematic diagram of the proposed approach: For training, a set of probability distributions over the dictionary 
atoms, i.e. K, is learned. We also infer sets of Bernoulli distributions indicating the probabilities of selection of the dictionary 
atoms in the expansion of data from each class. These distributions are used for inferring the support of the sparse codes. The 
(parameters of) Bernoulli distributions are later used for learning a classifier. The final dictionary is learned by sampling the 
distributions in H, whereas the sparse codes are computed as element-wise product of the support and the weights (inferred by 
the approach) of the codes. Combined, the dictionary and the codes faithfully represent the training data. For testing, sparse 
codes of the query over the dictionary are computed and fed to the classifier for labeling. 


sparse representation technique that infers a discriminative 
dictionary using a Beta Process f56| . Our approach adaptively 
builds the association between the dictionary atoms and the 
class labels such that this association signifies the probability 
of selection of the dictionary atoms in the expansion of class- 
specific data. Furthermore, the non-parametric character of 
the approach allows it to automatically infer the correct size 
of the dictionary. The scheme employed by our approach 
is shown in Fig. [I] We perform Bayesian inference over a 
model proposed for discriminative sparse representation of 
the training data. The inference process learns distributions 
over the dictionary atoms and sets of Bernoulli distributions 
associating the dictionary atoms to the labels of the data. 
The Bernoulli distributions govern the support of the final 
sparse codes and are later utilized in learning a multi-class 
linear classifier. The final dictionary is learned by sampling the 
distributions over the dictionary atoms and the corresponding 
sparse codes are computed by element-wise product of the 
support and the inferred weights of the codes. The computed 


dictionary and the sparse codes also represent the training data 
faithfully. 

A query is classified in our approach by first sparsely 
encoding it over the inferred dictionary and then classifying its 
sparse code with the learned classifier. In this work, we learn 
the classifier and the dictionary using the same hierarchical 
Bayesian model. This allows us to exploit the aforementioned 
Bernoulli distributions in the accurate estimate of the classifier. 
We present the proposed Bayesian model along its inference 
equations for Gibbs sampling. Our approach has been tested 
on two face-databases 0. 0, an object-database |3|. an 
action-database 0 and a scene-database The classification 
results are compared with the state-of-the-art discriminative 
sparse representation approaches. The proposed approach not 
only outperforms these approaches in terms of accuracy, its 
computational efficiency for the classification stage is also 
comparable to the most efficient existing approaches. 

This paper is organized as follows. We review the related 
work in Section [II] of the paper. In Section |III| we formulate the 
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problem and briefly explain the relevant concepts that clarify 
the rationale behind our approach. The proposed approach is 
presented in Section IV which includes details of the proposed 
model, the Gibbs sampling process, the classification scheme 
and the initialization of the proposed approach. Experimental 
results are reported in Section [V] and a discussion on the 
parameter settings is provided in Section [VI| We draw con¬ 
clusions in Section [ 


II. Related Work 

There are three main categories of the approaches that learn 
discriminative sparse representation. In the first category, the 
learned dictionary atoms have direct correspondence to the 
labels of the classes (26), (47), (12), (48), (35), (4^, (36). 
Yang et al. (26) proposed an SRC like framework for face 
recognition, where the atoms of the dictionary are learned 
from the training data instead of directly using the training 
data as the dictionary. In order to learn a dictionary that 
is simultaneously discriminative and reconstructive, Mairal 
et al. (47) used a discriminative penalty term in the K- 
SVD model (6), achieving state-of-the-art results on texture 
segmentation. Sprechmann and Sapiro (48) also proposed to 
learn dictionaries and sparse codes for clustering. In (36) , 
Castrodad and Sapiro computed class-specific dictionaries for 
actions. The dictionary atoms and their sparse coefficients also 
exploited the non-negativity of the signals in their approach. 
Active basis models are learned from the training images of 
each class and applied to object detection and recognition in 
(49) . Ramirez et al. ED have used an incoherence promoting 
term for the dictionary atoms in their learning model. Encour¬ 
aging incoherence among the class-specific sub-dictionaries 
allowed them to represent samples from the same class better 
than the samples from the other classes. Wang et al. (35) have 
proposed to learn class-specific dictionaries for modeling indi¬ 
vidual actions for action recognition. Their model incorporated 
a similarity constrained term and a dictionary incoherence term 
for classification. The above mentioned methods mainly asso¬ 
ciate a dictionary atom directly to a single class. Therefore, 
a query is generally assigned the label of the class whose 
associated atoms result in the minimum representational error 
for the query. The classification stages of the approaches under 
this category often require the computation of representations 
of the query over many sub-dictionaries. 

In the second category of the approaches for discriminative 
sparse representation, a single dictionary is shared by all the 
classes, however the representation coefficients are forced to 
be discriminative ( (9), (28), Q, (29), (30), (45), (3l), (50), 
(33), (51) ). Jiang et al. (9) proposed a dictionary learning 
model that encourages the sparse representation coefficients 
of the same class to be similar. This is done by adding 
a ’discriminative sparse-code error’ constraint to a unified 
objective function that already contains reconstruction error 
and classification error constraints. A similar approach is 
taken by Rodriguez and Sapiro (30) where the authors solve 
for a simultaneous sparse approximation problem (52) while 
learning the coefficients. It is common to learn dictionaries 
jointly with a classifier. Pham and Venkatesh (45) and Mairal 


et al. (28) proposed to train linear classifiers along the joint 
dictionaries learned for all the classes. Zhang and Li [7] 
enhanced the K-SVD algorithm [6] to learn a linear classi¬ 
fier along the dictionary. A task driven dictionary learning 
framework has also been proposed (3l) . Under this framework, 
different risk functions of the representation coefficients are 
minimized for different tasks. Broadly speaking, the above 
mentioned approaches aim at learning a single dictionary 
together with a classifier. The query is classified by directly 
feeding its sparse codes over the learned single dictionary to 
the classifier. Thus, in comparison to the approaches in the 
first category, the classification stage of these approaches is 
computationally more efficient. In terms of learning a single 
dictionary for the complete training data and the classification 
stage, the proposed approach also falls under this category of 
discriminative sparse representation techniques. 

The third category takes a hybrid approach for learning the 
discriminative sparse representation. In these approaches, the 
dictionaries are designed to have a set of shared atoms in addi¬ 
tion to class-specific atoms. Deng et al. (53) extended the SRC 
algorithm by appending an intra-class face variation dictionary 
to the training data. This extension achieves promising results 
in face recognition with a single training sample per class. 
Zhou and Fan (54) employ a Fisher-like regularizer on the 
representation coefficients while learning a hybrid dictionary. 
Wang and Kong ED learned a hybrid dictionary to separate 
the common and particular features of the data. Their approach 
additionally encouraged the class-specific dictionaries to be 
incoherent during the optimization process. Shen et al. (55) 
proposed to learn a multi-level dictionary for hierarchical 
visual categorization. To some extent, it is possible to reduce 
the size of the dictionary using the hybrid approach, which 
also results in reducing the classification time in comparison 
to the approaches that fall under the first category. However, 
it is often non-trivial to decide on how to balance between 
the shared and the class-specific parts of the hybrid dictionary 



III. Problem Formulation and Background 

Let X = [X 1 ,...,X C ,...,X C '] e M. mxN be the train- 
ing data comprising N instances from C classes, wherein 
X c G M mxA/c represents the data from the c th class and 
J2c=i N c — N. The columns of X c are indexed in X c . We 
denote a dictionary by ^ G with atoms ip k , where 

k G JC = {1 ,...,AT} and |.| represents the cardinality of the 
set. Let A G Rl /c l xiV be the sparse code matrix of the data, 
such that X « <FA. We can write A = [A 1 ,..., A c ,..., A c ], 
where A c G is the sub-matrix related to the c th 

class. The i th column of A is denoted as oli G To learn 
a sparse representation of the data, we can solve the following 
optimization problem: 

< <I>, A >= min ||X-$A||p s.t. V*, ||ai|L < t, (1) 

<i>,A 

where t is a predefined constant, \ \.\\f computes the Frobenius 
norm and ||.|| p denotes the £ p -norm of a vector. Generally, 
p is chosen to be 0 or 1 for sparsity (57) . The non-convex 
optimization problem of Eq. 0 can be iteratively solved 
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by fixing one parameter and solving a convex optimization 
problem for the other parameter in each iteration. The solution 
to Eq. 0. factors the training data X into two complemen¬ 
tary matrices, namely the dictionary and the sparse codes, 
without considering the class label information of the training 
data. Nevertheless, we can still exploit this factorization in 
classification tasks by using the sparse codes of the data as 
features (9), for which, a classifier can be obtained as 

N 

W = mmY / £{h i J(cx i ,W)} + \\\W\\ 2 F , (2) 

i= 1 


where W G M Cx l /c l contains the model parameters of the 
classifier, C is the loss function, hi is the label of the 
i th training instance G M m and A is the regularization 
parameter. 

It is usually suboptimal to perform classification based on 
sparse codes learned by an unsupervised technique. Consider¬ 
ing this, existing approaches 0, (45}, (29), (28]] proposed to 
jointly optimize a classifier with the dictionary while learning 
the sparse representation. One intended ramification of this 
approach is that the label information also gets induced into 
the dictionary. This happens when the information is utilized 
in computing the sparse codes of the data, which in turn, 
are used for computing the dictionary atoms, while solving 
Eq. <0 This results in improving the discriminative abilities 
of the learned dictionary. Jiang et al. 0 built further on 
this concept and encouraged explicit correspondence between 
the dictionary atoms and the class-labels. More precisely, 
the following optimization problem is solved by the Label- 
Consistent K-SVD (LC-KSVD2) algorithm 


< W, T, A >= min 

3>,W,T,A 



s.t. Vi ||a^||o < t 


& 

v^W 



where v and k are the regularization parameters, the binary 
matrix H G R CxN contains the class label informatioiQ T e 
is the transformation between the sparse codes and 
the discriminative sparse codes Q G Here, for the 

i th training instance, the i th column of the fixed binary matrix 
Q has 1 appearing at the k th index only if the k th dictionary 
atom has the same class label as the training instance. Thus, 
the discriminative sparse codes form a pre-defined relationship 
between the dictionary atoms and the class labels. This brings 
improvement to the discriminative abilities of the dictionary 
learned by solving Eq. 0. 

It is worth noting that in Label-Consistent K-SVD algo¬ 
rithm 0. the relationship between class-specific subsets of 
dictionary atoms and class labels is pre-defined. However, 
regularization allows flexibility in this association during opti¬ 
mization. We also note that using v = 0 in Eq. (3} reduces the 
optimization problem to the one solved by Discriminative K- 
SVD (D-KSVD) algorithm [7|. Successful results are achiev¬ 
able using the above mentioned techniques for recognition 
and classification. However, like any discriminative sparse 


'For the i th training instance, the i th column of H has 1 appearing only at 
the index corresponding to the class label. 


representation approach, these results are obtainable only after 
careful optimization of the algorithm parameters, including 
the dictionary size. In Fig. [2] we illustrate the behavior of 
recognition accuracy under varying dictionary sizes for [7] 
and (9) for two face databases. 

Paisley and Carin (56} developed a Beta Process for non- 
parametric factor analysis, which was later used by Zhou 
et al. (44} in successful image restoration and compressive 
sensing. Exploiting the non-parametric Bayesian framework, 
a Beta Process can automatically infer the factor/dictionary 
size from the training data. With the base measure Ho and 
parameters a Q > 0 and b Q > 0, a Beta Process is de¬ 
noted by BP(a 0 , 6 0 , So)- A draw from this process, i.e. h ~ 
BP(a 0 , 6 0 , So), can be represented as 

-kS<p k (v>), k e K, = K}, 

k 

jr k ~ Beta(w k \a 0 /K, b 0 (K - 1 )/K), 

(4) 

with this a valid measure as K —)> oc. In the above equation, 
is 1 when = tp k and 0 otherwise. Therefore, S can 
be represented as a set of \JC\ probabilities, each having an 
associated vector (p k , drawn i.i.d. from the base measure So. 
Using S, we can draw a binary vector z* G {0,1}^, such that 
the k th component of this vector is drawn ~ Bernoulli^). 
By independently drawing N such vectors, we may construct 
a matrix Z G {0, l}l /c l xivr , where z i is the i th column of this 
matrix. 

Using the above mentioned Beta Process, it is possible to 
factorize X G W nxN as follows: 

X = $Z + E, (5) 

where, <I> G has ip k as its columns and E G W nxN 

is the error matrix. In Eq. (5}, the number of non-zero 
components in a column of Z is a random number drawn 
from Poisson (a 0 /b 0 ) [56]. Thus, sparsity can be imposed on 
the representation with the help of parameters a Q and b Q . The 
components of the k th row of Z are independent draws from 
Bernoulli^/-). Let 7r G R^l be a vector with TTkeic, as its 
k th element. This vector governs the probability of selection 
of the columns of <I> in the expansion of the data. Existence 
of this physically meaningful latent vector in the Beta Process 
based matrix factorization plays a central role in the proposed 
approach for discriminative dictionary learning. 

IV. Proposed Approach 

We propose a Discriminative Bayesian Dictionary Learning 
approach for classification. For the c th class, the proposed 
approach draws \L C \ binary vectors z^ G R^l, Vi G X c 
using a Beta Process. For each class, the vectors are sampled 
using separate draws with the same base. That is, the matrix 
factorization is governed by a set of C probability vectors 
7r c e{i,...,c}, i ns t eac i of a single vector, however the inferred 
dictionary is shared by all the classes. An element of the 
aforementioned set, i.e. 7r c G R^l, controls the probability of 
selection of the dictionary atoms for a single class data. This 
promotes the discriminative abilities of the inferred dictionary. 
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Fig. 2: Examples of how recognition accuracy is affected with varying dictionary size: n = 0 for LC-KSVD1 and v = 0 
for D-KSVD in Eq. ([3j. All other parameters are kept constant at optimal values reported in [ 9 J. For the AR database, 2000 
training instances are used and testing is performed with 600 instances. For the Extended YaleB, half of the database is used 
for training and the other half is used for testing. The instances are selected uniformly at random. 


A. The Model 

Let ol\ G denote the sparse code of the i th training 
instance of the c th class, i.e. x? G M m , over a dictionary 

G Mathematically, x? = <3>a?+€i, where G M m 

denotes the modeling error. We can directly use the Beta 
Process discussed in Section m for computing the desired 
sparse code and the dictionary. However, the model employed 
by the Beta Process is restrictive, as it only allows the code to 
be binary. To overcome this restriction, let ol \ = z-Os^, where 
© denotes the Hadamard/element-wise product, z? G is 
the binary vector and s- G is a weight vector. We place a 
standard normal prior A/^s^JO, 1/A^J on the k th component 
of the weight vector where A^ o denotes the precision 
of the distribution. In here, as in the following text, we use 
the subscript ‘o’ to distinguish the parameters of the prior 
distributions. The prior distribution over the k th component of 
the binary vector is Bernoulfi(z^j7rjj: ). We draw the atoms 
of the dictionary from a multivariate Gaussian base, i.e. 
<Pk ~ J\f(<p k \ljL ko , A^ 1 ), where [i ko G M m is the mean vector 
and A ko G M mxm is the precision matrix for the k th atom of 
the dictionary. We model the error as zero mean Gaussian in 
M m . Thus, we arrive at the following representation model: 


x^ = + €i Mi G X c , Me 

Oi • = 

zfk ~ Bernoulli^| tt£J 

|0,1/ASJ 

n k ~ Beta.(Tr^\a 0 /K,b 0 (K — 1)/K) 

<Pk~M{<p k \n ko , A ko) Vfce/C 

ei~Ar(ei\0, A" 1 ) Vi e {1, N} (6) 


Notice, in the above model a conjugate Beta prior is placed 
over the parameter of the Bernoulli distribution, as mentioned 
in Section III Hence, a latent probability vector 7r c (with n k 
as its components) is associated with the dictionary atoms 
for the representation of the data from the c th class. The 
common dictionary ^ is inferred from C such vectors. In the 
above model, this fact is notationally expressed by showing the 
dictionary atoms being sampled from a common set of \JC\ dis¬ 
tributions, while distinguishing the class-specific variables in 



Fig. 3: Graphical representation of the proposed discriminative 
Bayesian dictionary learning model. 


the other notations with a superscript V. We assume the same 
statistics for the modeling error over the complete training 
datc0 We further place non-informative Gamma hyper-priors 
over the precision parameters of the normal distributions. 
That is, A^ ^ r(A^|c 0 ,d 0 ) and A e ~ r(A e |e 0 ,/ 0 ), where 
c G , e Q and f Q are the parameters of the respective Gamma 
distributions. Here, we allow the error to have an isotropic 
precision, i.e. A e = A e I m , where I m denotes the identity 
matrix in M mxm . The graphical representation of the complete 
model is shown in Fig. [3] 

B. Inference 

Gibbs sampling is used to perform Bayesian inference over 
the proposed modeQ Starting with the dictionary, below we 

2 It is also possible to use different statistics for different classes, however, 
in practice the assumption of similar noise statistics works well. We adopt 
the latter to avoid unnecessary complexity. 

3 Paisley and Carin |5b| derived variational Bayesian algorithm 1581 for 
their model. It was shown by Zhou et al. (44) that Gibbs sampling is an 
equally effective strategy in data representation using the same model. Since 
it is easier to relate the Gibbs sampling process to the learning process 
of conventional optimization based sparse representation (e.g. K-SVD (6)), 
we derive expressions for the Gibbs sampler for our approach. Due to the 
conjugacy of the model, these expressions can be derived analytically. 
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derive analytical expressions for the posterior distributions 
over the model parameters for the Gibbs sampler. The infer¬ 
ence process performs sampling over these posterior distribu¬ 
tions. The expressions are derived assuming zero mean Gaus¬ 
sian prior over the dictionary atoms, with isotropic precision. 
That is, n ko = 0 and A ko = A£ o I m . This simplification leads 
to faster sampling, without significantly affecting the accuracy 
of the approach. The sampling process samples the atoms 
of the dictionary one-by-one from their respective posterior 
distributions. This process is analogous to the atom-by-atom 
dictionary update step of K-SVD [6), however the sparse codes 
remain fixed during our dictionary update. 

Sampling cp k : For our model, we can write the following 
about the posterior distribution over a dictionary atom: 

N 

p(<Pk\~) «n^ x ^^ 0 ^^- ii m )V(^i° ) A- i m ). 

Here, we intentionally dropped the superscript V as the 
dictionary is updated using the complete training data. Let 
x* denote the contribution of the dictionary atom Lp k to the 
i th training instance x^: 

X iv , fc = Xj - $(z i © Si) + ip k (z ik © s ik ). (7) 

Using Eq. 0, we can re-write the aforementioned proportion¬ 
ality as 


where, Ci = exp (^-^^(WVkWhik ~ 2 ( x t k ) Tl Pk)) and 

£2 = exp ^ 11111^) • Furthermore, since the prior prob¬ 
ability of z ik ~ 0 is given by 1 — n k , we can write the 
following about its posterior probability: 


P( z ik = 0|-) OC (1 — 7Tfe 0 )C 2 - 

Thus, z\ k can be sampled from the following normalized 
Bernoulli distribution: 


Bernoulli 



_ p± _y 

Pl + (1 -7Tfe 0 )C2 ) 


By inserting the value of pi and simplifying, we finally arrive 
at the following expression for sampling Z ik : 


zf k ~ Bernoulli 





■ 7T 


k Q 


(Ci 



(9) 


Sampling s^ k : We can write the following about the pos¬ 
terior distribution over s c ik : 

p( s ik \-) <x V(x^ fc \v> k {z c ik .s c ik ), A-XW^IO, l/\ c So ). 

Again, notice that we are concerned with the c th class data 
only. In light of the above expression, s? k can be sampled 
from the following posterior distribution: 


N 

P(<Pk\~) <x n*<* <p k \^Pk( Z ik s ik)i A £o |0, A^ o Ira)« 

i —1 where, 


P(^ k \-)=^(s c ik \^l/X% 


( 10 ) 


Considering the above expression, the posterior distribution 
over a dictionary atom can be written as 

p(v>k\-) = (8) 

where, 

A e N 

M/c = ”T 'y^;( z ik- s ik)*i ipu ? 

Ak i= 1 

N 

A k A k Q + A £q ^ ^ [Zik'Sik) • 

i=l 

Sampling 4c : Once the dictionary atoms have been sam¬ 
pled, we sample Vi Gl c , Vfc E JC. Using the contribution 
of the k th dictionary atom, the posterior probability distribution 
over z? k can be expressed as 

P( z ik\~) *x V(x^ fc A“ 1 I m )Bemoulli(^ c fc |7r£ o ). 

Here we are concerned with the c th class only, therefore xj 
is computed with the c th class data in Eq. (|7|. With the prior 
probability of z ik — 1 given by 7t£ , we can write the following 
about its posterior probability: 

P( z ik = 1|—) « *k 0 ex P 

It can be shown that the right hand side of the above propor¬ 
tionality can be written as: 

Pl = *fc 0 ClC2, 


c c T c 

Ps = -JZ-Zik<PkXi Vk > 

K = K a + V 0 (4c) 2 IMIi- 


Sampling 7rg: Based on our model, we can also write the 
posterior probability distribution over tt); as 

p(nk\-)<x n Bem ° uiii (4M 0 ) Beta 

iEZ c 


7T k 


b 0 (K 
K' 


1) 


K 


Using the conjugacy between the distributions, it can be easily 
shown that the k th component of 7 r c must be drawn from the 
following posterior distribution during the sampling process: 


p(K H= Beta fife 


Oo_ ST' c bp{K - 1 ) 

K ^ ik ’ K 

iex c 



Sampling A^: In our model, the components of the weight 
vectors are drawn from a standard normal distribution. For 
a given weight vector, common priors are assumed over the 
precision parameters of these distributions. This allows us to 
express the likelihood function for A^ in terms of standard 
multivariate Gaussian with isotropic precision. Thus, we can 
write the posterior over A^ as the following: 


p(kh oc n * ( s i |°. 44 r ( x tKd 0 ). 

iez c A s 0 J 
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Using the conjugacy between the Gaussian and Gamma distri¬ 
butions, it can be shown that A^ must be sampled as follows: 

SlKlll + do) • W 


r a« 


\IC\N c 


1 

+ Co ’2 


Sampling A e : We can write the posterior over A e as 

N 

p(A e |-) oc JjA/'(x i |$(zj 0s i ),A“ 1 I m )r(A e Je o ,/ o ). 

i= 1 


Similar to A^, we can arrive at the following for sampling A e 
during the inferencing process: 


A e 


r( 


mN 

~ir +e °' 


1 

2^H Xi_ ^( Zi0Si )ll2+/o)- 

i= 1 


(13) 


As a result of Bayesian inference over the model, we 
obtain sets of posterior distributions over the model param¬ 
eters. We are particularly interested in two of them. Namely, 
the set of distributions over the dictionary atoms K = 
{Af(cp k \/j, k , Afc 1 ) • k G JC} C M m , and the set of probability 
distributions characterized by the vectors 7r cG ( 1 ’-"’ C '} g 
M omentarily, we defer the discussion on the latter. The former 
is used to compute the desired dictionary 3>. This is done 
by drawing multiple samples from the elements of K and 
estimating the corresponding dictionary atoms as respective 
means of the samples. Indeed, the mean parameters of the 
elements of K can also be chosen as the desired dictionary 
atoms. However, we prefer the former approach for robustness. 

The proposed model and the sampling process also results in 
inferring the correct size of the desired dictionary. We present 
the following Lemmas in this regard: 

K 

Lemma 4.1: For K —)> oc, E[£] = Vc, where £ = z tk- 

° k =1 


Proof Q According to the proposed model, the covariance of 
a data vector from the c th class can be given by: 


E[(x?)(x?n = 


a n K 


&o + bo(K — 1) A^ 


+ A. 


(14) 


In Eq. (14), fraction - +b Q ^ K _ 1 - ) appears due to the presence of 
zf in the model and the equation simplifies to E[(x?)(x?) T ] = 
+ A” 1 when we neglect z?. Here, K signifies the 
number of dictionary atoms required to represent the data 
vector. Notice in the equation, that as K -4 oo, we observe 

E[(x?)(x?) T ] —» Thus, in the limit K oo, 

corresponds to the expected number of non-zero components 

K 

in z£, given by E[£], where £ = z ik- 

k= 1 

Lemma 4.2: Once 7r^ = 0 in a given iteration of the 
sampling process, E[7 r k \ ~ 0 for the later iterations. 


Proof: According to Eq. Q, Vi G Z c , z c ik = 0 when 7v ko = 0. 
Once this happens, the posterior distribution over ir k becomes 

+ \T C \ (see 


Beta I 




where a = ^ and b = 


b 0 (K-l) 


K 


4 We follow [56] closely in the proof, however, our analysis also takes into 
account the class labels of the data, whereas no such data discrimination is 
assumed in [56) . 


Eq.[lTJ. Thus, the expected value of n k for the later iterations 
can be written as EItt?} = \ i i • With 

L a+b a 0 +b 0 {K-l)+K\Z c \ 

0 < a G , b Q < \L C \ <C K we can see that E^] « 0. 

In the Gibbs sampling process, we start with K —>> oo in 
our implementation and let 0 < a 0: b 0 < \T C \. Considering 
Lemma |4.1| the values of a Q and b Q are set to ensure that the 
resulting representation is sparse. We drop the k th dictionary 
atom during the sampling process if ir k = 0, for all the 
classes simultaneously. According to Lemma |4.2| dropping 
such an atom does not bring significant changes to the final 
representation. Thus, by removing the redundant dictionary 
atoms in each sampling iteration, we finally arrive at the 
correct size of the desired dictionary, i.e. \K\. 

As mentioned above, with Bayesian inference over the 
proposed model we also infer a set of probability vectors 
Tfceii element of this set, i.e. 7r c G fur¬ 

ther characterizes a set of probability distributions 2s c = 
{Bernoulli^) : k G JC} C M. Here, Bernoulli^) is jointly 
followed by all the k th components of the sparse codes for 
the c th class. If the k th dictionary atom is commonly used 
in representing the c th class training data, we must expect a 
high value of 7r k , and 7 t£ —)> 0 otherwise. In other words, 
for an arranged dictionary, components of 7r c having large 
values should generally cluster well if the learned dictionary 
is discriminative. Furthermore, these clusters must appear at 
different locations in the inferred vectors for different classes. 
Such clusterings would demonstrate the discriminative char¬ 
acter of the inferred dictionary. Fig. [4] verifies this character 
for the dictionaries inferred under the proposed model. Each 
row of the figure plots six different probability vectors (i.e. 
7r c ) for different training datasets. A clear clustering of the 
high value components of the vectors is visible in each plot. 
Detailed experiments are presented in Section [V] 


C. Classification 

Let y G M m be a query signal. We follow the common 
methodology 0 0 for classification that first encodes y 
over the inferred dictionary such that y = + e, and then 

computes i = Wq, where W G M Cx l /c l contains model 
parameters of a multi-class linear classifier. The query is as¬ 
signed the class label corresponding to the largest component 
of l G M c . The main difference between the classification 
approach of this work and that of the existing techniques 
is in the learning process of W. Whereas discrimination is 
induced in $ by the joint optimization of W and in the 
existing techniques (see Eq. [3]), this is already achieved in 
the inference process of the proposed approach. Thus, it is 
possible to optimize a classifier separately from the dictionary 
learning process without affecting the discriminative abilities 
of the learned dictionary. 

Let h \ G be a binary vector with the only 1 appearing 

at the c th index, indicating the class of the training instance 
x£. Let H G M CxAr be the binary index matrix formed by 
such vectors for the complete training data X. We aim at 
computing W such that H = WB + E, where E G M CxAr 
denotes the modeling error and B G Rl^l^ is the coefficient 
matrix. Notice that, we can directly use the model in Eq. 0 











Fig. 4: Illustration of the discriminative character of the inferred dictionary: From top, the four rows present results on AR 
database |lj, Extended YaleB (2), Caltech-101 [3] and Fifteen Scene categories 0 , respectively. In each plot, the x-axis 
represents fc E JC and the y-axis shows the corresponding probability of selection of the k th dictionary atom in the expansion 
of the data. A plot represents a single 7 r c vector learned as a result of Bayesian inference. For the first three rows, from left 
to right, the value of c (i.e. class label) is 1,5, 10, 15, 20 and 25, respectively. For the fourth row the value of c is 1, 3, 5, 7, 
9 and 11 for the plots from left to right. Plots clearly show distinct clusters of high probabilities for different classes. 


to compute W. For that, we can write h? = W/3^ + e^, 
where /3? E is a column of B. Thus, we infer W 

under the Bayesian framework using the model proposed 
in Eq. & While learning this matrix, we perform Gibbs 
sampling such that the probability vectors 7 r c ^{ 1 v,C'} are ^ e p t 
constant to those finally inferred by the dictionary learning 
stage. That is, wherever required, the value of i is directly 
used from >•••,<?} i ns t ea( | 0 f inferring a new value during 
the sampling process. 

The reason for using the same 7 r c ^{ 1 v,C'} vectors for 
inferring W and <I> is straightforward. Since we first sparse 
code the query over the learned discriminative dictionary, we 
expect the underlying support of the learned codes to follow 
some 7 r c closely. Thus, W can be expected to classify the 
learned codes better if the discriminative information regarding 
their support is encoded in it. Notice that, unlike the existing 
approaches (e.g. [7|, (5)) the coupling between W and is 
kept probabilistic in our approach. We do not assume that 
the ’exact values’ of the sparse codes of the query would 
match to those of the training sample (and hence W and 

should be trained jointly), rather, our assumption is that 
samples from the same class are more likely explainable using 
similar basis. Therefore, coupling between W and is kept 
in terms of probabilistic selection of their columns. Our view 
point also makes Orthogonal Matching Pursuit (OMP) (60) a 
natural choice for sparse coding the query over the dictionary. 
This greedy pursuit algorithm efficiently searches for the right 
basis to represent the data. Therefore, we used OMP in sparse 
coding the query over the learned dictionary. 


D. Initialization 

For inferring the dictionary, we need to first initialize 
z£, s? and 7 r£. We initialize <I> by randomly selecting the 
training instances with replacement. We sparsely encode x? 
over the initial dictionary using OMP | [60| . The sparse codes 
are considered as the initial s£, whereas their support forms 
the initial vector z£. Computing the initial s? and z? with other 
methods, such as regularized least squares, is equally effective. 
We set 7 = 0.5, Vc, Vfc for the initialization. Notice, this 
means that all the dictionary atoms initially have equal chances 
of getting selected in the expansion of a training instance 
from any class. The values of 7 Vc, Vfc finally inferred by the 
dictionary learning process serve as the initial values of these 
parameters for learning the classifier. Similarly, the vectors z? 
and s \ computed by the dictionary learning stage are used 
for initializing the corresponding vectors for the classifier. We 
initialize W using the ridge regression model |6Tj with the 
^ 2 -norm regularizer and quadratic loss: 

W = min ||H — Wccj || 2 + A||W|||, V* € {1,JV}, (15) 
w 

where A is the regularization constant. The computation is 
done over the complete training data, therefore the superscript 
‘c’ is dropped in the above equation. 

V. Experiments 

We have evaluated the proposed approach on two face data 
sets: the Extended YaleB ID and the AR database |[lj, a data 
set for object categories: Caltech-101 [3|, a data set for scene 
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categorization: Fifteen scene categories 0, and an action data 
set: UCF sports actions [5}. These data sets are commonly used 
in the literature for evaluation of sparse representation based 
classification techniques. We compare the performance of the 
proposed approach with SRC (8), the two variants of Label- 
Consistent K-SVD 0 (i.e. LC-KSVD1, LC-KSVD2), the 
Discriminative K-SVD algorithm (D-KSVD) [7], the Fisher 
Discrimination Dictionary Learning algorithm (FDDL) jT0| 
and the Dictionary Learning based on separating the Common¬ 
alities and the Particularities of the data (DL-COPAR) GD- 
In our comparisons, we also include results of unsupervised 
sparse representation based classification that uses K-SVD 0 
as the dictionary learning technique and separately computes 
a multi-class linear classifier using Eq. GD- 

For all of the above mentioned methods, except SRC and D- 
KSVD, we acquired the public codes from the original authors. 
To implement SRC, we used the LASSO [63 ] solver of the 
SPAMS toolbox [62]]. For D-KSVD, we used the public code 
provided by Jiang et al. (9) for LC-KSVD2 algorithm and 
solved Eq. ([3]) with v = 0. The experiments are performed on 
an Intel Core i7-2600 CPU at 3.4 GHz with 8 GB RAM. We 
performed our own experiments using the above mentioned 
methods and the proposed approach using the same data. The 
parameters of the existing approaches were carefully optimized 
following the guidelines of the original works. We mention 
the used parameter values and, where it exists, we note the 
difference between our values and those used in the original 
works. In our experiments, these differences were made to 
favor the existing approaches. Results of the approaches other 
than those mentioned above, are taken directly from the 
literature, where the same experimental protocol has been 
followed. 

For the proposed approach, the used parameter values were 
as follows. In all experiments, we chose K = 1.5 N for 
initialization, whereas c 0 ,d 0 ,e 0 and f Q were all set to 10 -6 . 
We selected a Q = b Q = min c l Xc l , whereas A So and Xk a were 
set to 1 and m, respectively. Furthermore, A Co was set to 10 6 
for all the datasets except for Fifteen Scene Categories 0. 
where we used A Co = 10 9 . In each experiment, the Bayesian 
inference was performed with 35 Gibbs sampling iterations. 
We defer further discussion on the selection of the parameter 
values to Section M 


A. Extended YaleB 

Extended YaleB © contains 2,414 frontal face images 
of 38 different people, each having about 64 samples. The 
images are acquired under varying illumination conditions and 
the subjects have different facial expressions. This makes the 


database fairly challenging, see Fig [5a] for examples. In our 
experiments, we used the random face feature descriptor ©, 
where a cropped 192 x 168 pixels image was projected onto 
a 504-dimensional vector. For this, the projection matrix was 
generated from random samples of standard normal distribu¬ 
tions. Following the common settings for this database, we 
chose one half of the images for training and the remaining 
samples were used for testing. We performed ten experiments 
by randomly selecting the samples for training and testing. 


E; fEEl 






(b) AR database (lj 


Fig. 5: Examples from the face databases. 


Based on these experiments, the mean recognition accuracies 
of different approaches are reported in Table [I] The results 
for Locality-constrained Linear Coding (LLC) | |15| is directly 
taken from |9], where the accuracy is computed using 70 local 
bases. 

Similar to Jiang et al. (9j, the sparsity threshold for K-SVD, 
LC-KSVD1, LC-KSVD2 and D-KSVD was set to 30 in our 
experiments. Larger values of this parameter were found to be 
ineffective as they mainly resulted in slowing the algorithms 
without improving the recognition accuracy. Furthermore, as 
in 0, we used v = 4.0 for LC-KSVD1 and LC-KSVD2, 
whereas n was set to 2.0 for LC-KSVD2 and D-KSVD in 
Eq. ([3|. Keeping these parameter values fixed, we optimized 
for the number of dictionary atoms for each algorithm. This 
resulted in selecting 600 atoms for LC-KSVD2, D-KSVD and 
K-SVD, whereas 500 atoms consistently resulted in the best 
performance of LC-KSVD1. This value is set to 570 in 0 for 
all of the four methods. In all techniques that learn dictionaries, 
we used the complete training data in the learning process. 
Therefore, all training samples were used as dictionary atoms 
for SRC. Following (8), we set the residual error tolerance to 
0.05 for SRC. Smaller values of this parameter also resulted in 
very similar accuracies. For FDDL, we followed |[10) for the 
optimized parameter settings. These settings are the same as 
those reported for AR database in the original work. We refer 
the reader to the original work for the list of the parameters and 
their exact values. The results reported in the table are obtained 
by the Global Classifier (GC) of FDDL, which showed better 
performance than the Local Classifier (LC). For the parameter 
settings of DL-COPAR we followed the original work GD- 
We fixed 15 atoms for each class and a set of 5 atoms was 
chosen to learn commonalities of the classes. The reported 
results are achieved by LC, that performed better than GC in 
our experiments. 

It is clear from Table [I] that our approach outperforms the 
above mentioned approaches in terms of recognition accuracy, 
with nearly 23% improvement over the error rate of the second 
best approach. Furthermore, the time required by the proposed 
approach for classifying a single test instance is also very low 
as compared to SRC, FDDL and DL-COPAR. For the pro- 
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TABLE I: Recognition accuracy with Random-Face features 
on the Extended YaleB database (2). The computed average 
time is for classification of a single instance. 


Method 

Accuracy % 

Average Time (ms) 

LLC 115 ) 


90.7 

- 

K-Svfc™ 

1“ 

93.13 ±0.43 

0.37 

LC-KSVUl 

93.59 ±0.54 

0.36 

D-KSVD |7 

94.79 ± 0.49 

0.38 

DL-COPAR 

ED 

94.83 ±0.52 

32.55 

LC-KSVD2 

Er 

95.22 ±0.61 

0.39 

fddl Flo] 


96.07 ±0.64 

49.59 

SRC (8J— 


96.32 ±0.85 

53.12 

Proposed 


97.19 ±0.71 

1.23 


posed approach, this time is comparable to D-KSVD and LC- 
KSVD. Like these algorithms, the computational efficiency in 
the classification stage of our approach comes from using the 
learned multi-class linear classifier to classify the sparse codes 
of a test instance. 


TABLE II: Recognition accuracy with Random-Face features 
on the AR database 111. The computed time is for classifying 
a single instance. The J sign denotes the results using the 
parameter settings reported in the original works. 


Method 

Accuracy % 

Average Time (ms) 

LLC 1151 
DL-COPAR fill 

88.7 

- 

93.23 ± 1.71 

39.80 

LC-KSVD1 pr 

93.48 ± 1.13 

0.98 

LC-KSVDlt 

87.48 ± 1.19 

0.37 

K-SVD jl 
LC-KSVD2 (9) 
LC-KSVD21 

94.13 ± 1.20 

0.99 

95.33 ± 1.24 

1.01 

88.35 ± 1.33 

0.41 

D-KSVD m 

95.47 ± 1.50 

1.01 

D-KSVDi 

88.29 ± 1.38 

0.38 

FDDLj 10 

96.22 ± 1.03 

50.03 

SRC (8)“^ 

96.65 ± 1.37 

62.86 

Proposed 

97.41 ± 1.04 

1.27 



B. AR Database 

This database contains more than 4,000 color face images 
of 126 people. There are 26 images per person taken during 
two different sessions. In comparison to Extended YaleB, the 
images in AR database have larger variations in terms of facial 
expressions, disguise and illumination conditions. Samples 
from AR database are shown in Fig. [5b] for illustration. We 
followed a common evaluation protocol in our experiments 
for this database, in which we used a subset of 2600 images 
pertaining to 50 males and 50 female subjects. For each 
subject, we randomly chose 20 samples for training and the 
rest for testing. The 165 x 120 pixel images were projected onto 
a 540-dimensional vector with the help of a random projection 


matrix, as in Section V-A We report the average recognition 
accuracy of our experiments in Table [Illwhich also includes 
the accuracy of LLC (B) reported in0. The mean values 
reported in the table are based on ten experiments. 

In our experiments, we set the sparsity threshold for K-SVD, 
LC-KSVD1, LC-KSVD2 and D-KSVD to 50 as compared 
to 10 and 30 which was used in |7j and respectively. 
Furthermore, the dictionary size for K-SVD, LC-KSVD2 and 
D-KSVD was set to 1500 atoms, whereas the dictionary size 
for LC-KSVD1 was set to 750. These large values (compared 
to 500 used in (7), (9)) resulted in better accuracies at the 
expense of more computation. However, the classification time 
per test instance remained reasonably small. In Table [II] we 
also include the results of LC-KSVD1, LC-KSVD2 and D- 
KSVD using the parameter values proposed in the original 
works. These results are distinguished with the J sign. For 
FDDL and DL-COPAR we used the same parameter settings 


as in Section |V-A| The reported results are for GC and LC 
for FDDL and DL-COPAR, respectively. For SRC we set the 
residual error tolerance to 10 -6 . This small value gave the 
best results. 

From Table [II] we can see that the proposed approach 
performs better than the existing approaches in terms of 
accuracy. The recognition accuracies of SRC and FDDL are 
fairly close to our approach however, these algorithms require 
large amount of time for classification. This fact compromises 


(a) Minaret 



(c) Stop sign 

Fig. 6: Examples from Caltech-101 database |3|. The proposed 
approach achieves 100% accuracy on these classes. 


their practicality. In contrast, the proposed approach shows 
high recognition accuracy (i.e. 22% reduction in the error 
rate as compared to SRC) with less than 1.5 ms required 
for classifying a test instance. The relative difference between 
the classification time of the proposed approach and the 
existing approaches remains similar in the experiments below. 
Therefore, we do not explicitly note these timings for all of 
the approaches in these experiments. 

C. Caltech-101 

The Caltech-101 database (3) comprises 9,144 samples 
from 102 classes. Among these, there are 101 object classes 
(e.g. minarets, trees, signs) and one “background” class. The 
number of samples per class varies from 31 to 800, and the 
images within a given class have significant shape variations, 
as can be seen in Fig. [6] To use the database, first the 
SIFT descriptors (64) were extracted from 16 x 16 image 
patches, which were densely sampled with a 6-pixels step size 
for the grid. Then^ based on the extracted features, spatial 
pyramid features [38 ] were extracted with 2 l x 2 l grids, where 
/ = 0,1, 2. The codebook for the spatial pyramid was trained 
using k -means with k = 1024. Then, the dimension of a spatial 
pyramid feature was reduced to 3000 using PCA. Following 
the common experimental protocol, we selected 5, 10, 15, 
20, 25 and 30 instances for training the dictionary and the 
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TABLE III: Classification results using Spatial Pyramid Fea¬ 
tures on the Caltech-101 dataset |3j. 


Total training samples 

5 

10 

15 

20 

25 

30 

Zhang et al. |37] 

46.6 

55.8 

59.1 

62.0 

- 

66.20 

Lazebnik et al. 1381 

- 

- 

56.4 

- 

- 

64.6 

Griffin et al. |39) 

44.2 

54.5 

59.0 

63.3 

65.8 

67.6 

Wang et al. | IS | 

51.1 

59.8 

65.4 

67.7 

70.2 

73.4 

SRC |8 | 

49.9 

60.1 

65.0 

67.5 

69.3 

70.9 

DL-COPAR |11] 
K-SVD 

49.7 

58.9 

65.2 

69.1 

71.0 

72.9 

51.2 

59.1 

64.9 

68.7 

71.0 

72.3 

FDDL frol 

52.1 

59.8 

66.2 

68.9 

71.3 

73.1 

D-Ksvfcrm 

52.1 

60.8 

66.1 

69.6 

70.8 

73.1 

LC-KSVD ijl] 

53.1 

61.2 

66.3 

69.8 

71.9 

73.5 

LC-KSVD2 0 

53.8 

62.8 

67.3 

70.4 

72.6 

73.9 

Proposed 

53.9 

63.1 

67.7 

70.9 

73.2 

74.6 


remaining instances were used in testing, in our six different 
experiments. Each experiment was repeated ten times with 
random selection of train and test data. The mean accuracies 
of these experiments are reported in Table [HI] 


For this dataset, we set the number of dictionary atoms 
used by K-SVD, LC-KSVD1, LC-KSVD2 and D-KSVD to 
the number of training examples available. This resulted in the 
best performance of these algorithms. The sparsity level was 
also set to 50 and v and n were set to 0.001. Jiang et al. (9) 
also suggested the same parameter settings. For SRC, the error 
tolerance of 10“ 6 gave the best results in our experiments. We 
used the parameter settings for object categorization given in 
(lO) for FDDL. For DL-COPAR, the selected number of class- 
specific atoms were kept the same as the number of training 
instances per class, whereas the number of shared atoms were 
fixed to 314, as in the original work m . For this database GC 
performed better than LC for DL-COPAR in our experiments. 


From Table m it is clear that the proposed approach 
consistently outperforms the competing approaches. For some 
cases the accuracy of LC-KSVD2 is very close to the proposed 
approach, however with the increasing number of training in¬ 
stances the difference between the results increases in favor of 
the proposed approach. This is an expected phenomenon since 
more training samples result in more precise posterior distri¬ 
butions in Bayesian settings. Here, it is also worth mentioning 
that being Bayesian, the proposed approach is inherently an 
online technique. This means, in our approach, the computed 
posterior distributions can be used as prior distributions for 
further inference if more training data is available. Moreover, 
our approach is able to handle a batch of large training data 
more efficiently than LC-KSVD 0 and D-KSVD Q. This can 
be verified by comparing the training time of the approaches 
in Table [Iv| The timings are given for complete training and 
testing durations for Caltech-101 database, where we used a 
batch of 30 images per class for training and the remaining 
images were used for testing. We note that, like all the other 
approaches, good initialization (using the procedure presented 
in Section |IV-D[ ) also contributes towards the computational 
efficiency of our approach. The training time in the table also 
includes the initialization time for all the approaches. Note 
that the testing time of the proposed approach is very similar 


to those of the other approaches in Table IV 


TABLE IV: Computation time for training and testing on 
Caltech-101 database 


Method 

Training (sec) 

Testing (sec) 

Proposed 

1474 

19.96 

D-KSVD 0 

3196 

19.90 

LC-KSVD 1 191 

5434 

19.65 

LC-KSVD2 pj 

5434 

19.92 



Fig. 7: Examples images from eight different categories in 
Fifteen Scene Categories dataset 0. 


D. Fifteen Scene Category 

The Fifteen Scene Category dataset 0 has 200 to 400 
images per category for fifteen different kinds of scenes. The 
scenes include images from kitchens, living rooms and country 
sides etc. In our experiments, we used the Spatial Pyramid 
Features of the images, which have been made public by 
Jiang et al. [9]. In this data, each feature descriptor is a 
3000-dimensional vector. Using these features, we performed 
experiments by randomly selecting 100 training instances per 
class and considering the remaining as the test instances. 

Classification accuracy of the proposed approach is com¬ 
pared with the existing approaches in Table [V] The reported 
mean values are computed over ten experiments. We set the 
error tolerance for SRC to 10 -6 and used the parameter 
settings suggested by Jiang et al. (9) for LC-KSVD 1, LC- 
KSVD2 and D-KSVD. Parameters of DL-COPAR were set as 
suggested in the original work CD for the same database. The 
reported results are obtained by LC for DL-COPAR. Again, 
the proposed approach shows more accurate results than the 
existing approaches. The accuracy of the proposed approach 
is 1.66% more than LC-KSVD2 on the used dataset. 


E. UCF Sports Action 

This database comprises video sequences that are collected 
from different broadcast sports channels (e.g. ESPN and BBC) 
(5). The videos contain 10 categories of sports actions that 
include: kicking, golfing, diving, horse riding, skateboarding, 
running, swinging, swinging highbar, lifting and walking. 
Examples from this dataset are shown in Fig. [8] Under the 
common evaluation protocol we performed fivefold cross val¬ 
idation over the dataset, where four folds are used in training 
and the remaining one is used for testing. Results, computed as 
the average of the five experiments, are summarized in Table 
[VI] For D-KSVD, LC-KSVD 1 and LC-KSVD2 we followed 
[9] for the parameter settings. Again, the value of 10 -6 (along 
with similar small values) resulted in the best accuracies for 
SRC. 
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TABLE V: Classification accuracy on Fifteen Scene Category 
dataset 0] using Spatial Pyramid Features. 


Method 

Accuracy % 

K-SVD 

LC-KSVD1 ^ 
D-KSVD (71 

SRC (8'| 

DL-COPAR IITJ 

LC-KSVD2 

Proposed 

93.60 ±0.14 
94.05 ±0.17 
96.11 ±0.12 
96.21 ±0.09 
96.91 ±0.22 
97.01 ±0.23 
98.67 ±0.19 


TABLE VI: Classification rates on UCF Sports Action 
dataset |5j 


Method 

Accuracy % 

Method 

Accuracy % 

Qiu et al. [ 33] 

83.6 

LC-KSVD2 ^ 

91.5 

D-KSVD [7r 

89.1 

DLSI F2] 

92.1 

LC-KSVD1 191 

89.6 

SRC [8 ] 

92.7 

DL-COPAR Jrij 

90.7 

FDDL fio] 

93.6 

Sadanand (40]~ f 

90.7 

ldl [Sr 

95.0 

Proposed 

95.1 





In the Table, the results for some specific action recognition 
methods are also included, for instance, Qui et al. (33) and 
action back feature with SVM [40]. These results are taken 
directly from (L3) along the results of DLSI [ 121, DL-COPAR 


[11] and FDDL [ 10 j Following |40) , we also performed 
leave-one-out cross validation on this database for the pro¬ 
posed approach. Our approach achieves 95.7% accuracy under 
this protocol, which is 0.7% better than the state-of-the-art 
results claimed in (40). 


VI. Discussion 

In our experiments, we chose the values of K, a Q and b Q 
in light of the theoretical results presented in Section |IV-B 
By setting K > N we make sure that K is very large. The 
results mainly remain insensitive to other similar large values 
of this parameter. The chosen values of a Q and b Q ensure 
that 0 < a 0 ,6 0 < \I C \. We used large values for A Co in our 
experiments as this parameter represents the precision of the 
white noise distribution in the samples. The datasets used in 
our experiments are mainly clean in terms of white noise. 
Therefore, we achieved the best performance with A Co > 10 6 . 
In the case of noisy data, this parameter value can be adjusted 
accordingly. For UCF sports action dataset A Co = 10 9 gave 
the best results because less number of training samples were 
available per class. It should be noted that the value of A e 
increases as a result of Bayesian inference with the availability 
of more clean training samples. Therefore, we adjusted the 
precision parameter of the prior distribution to a larger value 
for UCF dataset. Among the other parameters, c Q to f Q were 
fixed to 10 -6 . Similar small non-negative values can also be 
used without affecting the results. This fact can be easily 
verified by noticing the large values of the other variables 
involved in equations ( fl2| ) and GT where these parameters 

5 The results of DL-COPAR 1111 and FDDL 1101 are taken directly from 
the literature because the optimized parameter values for these algorithms are 
not previously reported for this dataset. Our parameter optimization did not 
outperform the reported accuracies. 


are used. With the above mentioned parameter settings and the 
initialization procedure presented in Section IV-D the Gibbs 
sampling process converges quickly to the desired distributions 
and the correct number of dictionary atoms, i.e. \JC\. In Fig. [9] 
we plot the value of \JC\ as a function of Gibbs sampling 
iterations during dictionary training. Each plot represent a 
complete training process for one dataset. It can be easily seen 
that the first 10 iterations of the Gibbs sampling process were 
enough to infer the correct size of the dictionary. However, It 
should be mentioned that this fast convergence also owes to the 
initialization process adopted in this work. In our experiments, 
while sparse coding a test instance over the learned dictionary, 
we consistently used the sparsity threshold of 50 for all the 
datasets except for the UCF [5], for which this parameter was 
set to 40 because of the smaller dictionary resulting from less 
training samples. In all the experiments, these values were 
also kept the same for K-SVD, LC-KSVD1, LC-KSVD2 and 
D-KSVD for fair comparisons. 


VII. Conclusion 

We proposed a non-parametric Bayesian approach for learn¬ 
ing discriminative dictionaries for sparse representation of 
data. The proposed approach employs a Beta process to infer 
a discriminative dictionary and sets of Bernoulli distributions 
associating the dictionary atoms to the class labels of the 
training data. The said association is adaptively built during 
Bayesian inference and it signifies the selection probabilities 
of dictionary atoms in the expansion of class-specific data. The 
inference process also results in computing the correct size of 
the dictionary. For learning the discriminative dictionary, we 
presented a hierarchical Bayesian model and the corresponding 
inference equations for Gibbs sampling. The proposed model 
is also exploited in learning a linear classifier that finally 
classifies the sparse codes of a test instance that are learned 
using the inferred discriminative dictionary. The proposed 
approach is evaluated for classification using five different 
databases of human face, human action, scene category and 
object images. Comparisons with state-of-the-art discrimina¬ 
tive sparse representation approaches show that the proposed 
Bayesian approach consistently outperforms these approaches 
and has computational efficiency close to the most efficient 
approach. 

Whereas its effectiveness in terms of accuracy and computa¬ 
tion is experimentally proven in this work, there are also other 
key advantages that make our Bayesian approach to discrim¬ 
inative sparse representation much more appealing than the 
existing optimization based approaches. Firstly, the Bayesian 
framework allows us to learn an ensemble of discriminative 
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Fig. 9: Size of the inferred dictionary, i.e. |/C|, as a function of the Gibbs sampling iterations. Each plot represents a complete 
training process for a given dataset. 


dictionaries in the form of probability distributions instead 
of the point estimates that are learned by the optimization 
based approaches. Secondly, it provides a principled approach 
to estimate the required dictionary size and we can associate 
the dictionary atoms and the class labels in a physically 
meaningful manner. Thirdly, the Bayesian framework makes 
our approach inherently an online technique. Furthermore, the 
Bayesian framework also provides an opportunity of using 
domain/class-specific prior knowledge in our approach in a 
principled manner. This can prove beneficial in many appli¬ 
cations. For instance, while classifying the spectral signatures 
of minerals on pixel and sub-pixel level in remote-sensing 
hyperspectral images, the relative smoothness of spectral sig¬ 
natures (65) can be incorporated in the inferred discriminative 
bases. For this purpose, Gaussian Processes (66) can be used 
as a base measure for the Beta Process. Adapting the proposed 
approach for remote-sensing hyperspectral image classification 
is also our future research direction. 
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